We propose a fairness-aware learning framework that mitigates intersectional subgroup bias associated with protected attributes. Prior research has primarily focused on mitigating one kind of bias by incorporating complex fairness-driven constraints into optimization objectives or designing additional layers that focus on specific protected attributes. We introduce a simple and generic bias mitigation approach that prevents models from learning relationships between protected attributes and output variable by reducing mutual information between them. We demonstrate that our approach is effective in reducing bias with little or no drop in accuracy. We also show that the models trained with our learning framework become causally fair and insensitive to the values of protected attributes. Finally, we validate our approach by studying feature interactions between protected and non-protected attributes. We demonstrate that these interactions are significantly reduced when applying our bias mitigation.
translated by 谷歌翻译
语言模型既展示了定量的改进,又展示了新的定性功能,随着规模的增加。尽管它们具有潜在的变革性影响,但这些新能力的特征却很差。为了为未来的研究提供信息,为破坏性的新模型能力做准备,并改善社会有害的效果,至关重要的是,我们必须了解目前和近乎未来的能力和语言模型的局限性。为了应对这一挑战,我们介绍了超越模仿游戏基准(Big Bench)。 Big Bench目前由204个任务组成,由132家机构的442位作者贡献。任务主题是多样的,从语言学,儿童发展,数学,常识性推理,生物学,物理学,社会偏见,软件开发等等。 Big-Bench专注于被认为超出当前语言模型的功能的任务。我们评估了OpenAI的GPT型号,Google内部密集变压器体系结构和大型基础上的开关稀疏变压器的行为,跨越了数百万到数十亿个参数。此外,一个人类专家评估者团队执行了所有任务,以提供强大的基准。研究结果包括:模型性能和校准都随规模改善,但绝对的术语(以及与评估者的性能相比);在模型类中的性能非常相似,尽管带有稀疏性。逐渐和预测的任务通常涉及大量知识或记忆成分,而在临界规模上表现出“突破性”行为的任务通常涉及多个步骤或组成部分或脆性指标;社交偏见通常会随着含糊不清的环境而随着规模而增加,但这可以通过提示来改善。
translated by 谷歌翻译
无人驾驶航空公司(I-U-U-U-U-U-U-U-U-UV)的互联网承诺通过无人机之间的有效合作,快速,强大,经济高效地完成传感和传输任务。为实现有前途的好处,应解决至关重要的I-UAV网络问题。本文认为,I-UAV网络可以分为三类,服务质量(QoS)驱动网络,体验质量(QoE)驱动的网络,以及情况感知网络。每类网络都会带来了对我国无人机任务的安全有效地实现的严重影响的新兴挑战。本文精心详细分析了这些挑战,并阐述了相应的智能方法来解决I-UAV网络问题。此外,考虑到通过与高海拔平台(HAPS)合作扩展I-UAV网络可扩展性的升高效果,本文概述了集成的HAP和I-UAV网络,并提出了相应的网络挑战和智能方法。
translated by 谷歌翻译
由于推荐基本上是比较(或排名)的过程,良好的解释应该向用户说明为什么一个项目被认为比另一个项目更好,即关于推荐项目的比较解释。理想情况下,在阅读解释之后,用户应达到与系统的相同的项目排名。不幸的是,尚未对这种比较解释支付的研究注意力。在这项工作中,我们开发了提取物和精炼架构,以解释来自推荐系统的一组排名项目之间的相对比较。对于每个推荐的项目,我们首先将一个句子从其相关审核中提取一个句子,最能诉诸于一组参考项的所需比较。然后,该提取的句子通过生成模型相对于目标用户进一步阐述,以更好地解释为什么建议该项目。我们根据BLEU设计一个新的解释质量指标,指导提取和细化组件的端到端培训,避免生成通用内容。对两个大型推荐基准数据集的广泛离线评估和针对一系列最先进的可解释的建议算法的严重用户研究表明了比较解释的必要性和我们解决方案的有效性。
translated by 谷歌翻译
在解决双球员零和游戏时,多代理强化学习(MARL)算法通常会在每次迭代时创造代理人群,在每次迭代时,将被发现为对对手人口对混合的最佳响应。在这样的过程中,“遵循”(即对手混合物)和“如何击败它们”(即寻找最佳响应)的更新规则是由手动开发的游戏理论原则基础,如虚构的游戏和双倍甲骨文。在本文中,我们介绍了一种新颖的框架 - 神经自动课程(NAC) - 利用元梯度下降来自动化学习更新规则的发现,而无明确的人类设计。具体而言,我们通过优化子程序参数通过神经网络和最佳响应模块参数化对手选择模块,并通过与游戏引擎的交互仅更新其参数,其中播放器旨在最大限度地减少其利用性。令人惊讶的是,即使没有人类的设计,发现的Marl算法也可以通过基于最先进的人口的游戏,在技能游戏,可微分的乐透,不转化的混合物游戏中实现竞争或更好的性能,实现竞争或更好的性能。迭代匹配的便士和kuhn扑克。此外,我们表明NAC能够从小型游戏到大型游戏,例如Kuhn Poker培训,在LEDUC扑克上表现优于PSRO。我们的工作激发了一个未来的未来方向,以完全从数据发现一般的Marl算法。
translated by 谷歌翻译
Reading comprehension of legal text can be a particularly challenging task due to the length and complexity of legal clauses and a shortage of expert-annotated datasets. To address this challenge, we introduce the Merger Agreement Understanding Dataset (MAUD), an expert-annotated reading comprehension dataset based on the American Bar Association's 2021 Public Target Deal Points Study, with over 39,000 examples and over 47,000 total annotations. Our fine-tuned Transformer baselines show promising results, with models performing well above random on most questions. However, on a large subset of questions, there is still room for significant improvement. As the only expert-annotated merger agreement dataset, MAUD is valuable as a benchmark for both the legal profession and the NLP community.
translated by 谷歌翻译
We demonstrate how efficient autonomous drone swarms can be in detecting and tracking occluded targets in densely forested areas, such as lost people during search and rescue missions. Exploration and optimization of local viewing conditions, such as occlusion density and target view obliqueness, provide much faster and much more reliable results than previous, blind sampling strategies that are based on pre-defined waypoints. An adapted real-time particle swarm optimization and a new objective function are presented that are able to deal with dynamic and highly random through-foliage conditions. Synthetic aperture sensing is our fundamental sampling principle, and drone swarms are employed to approximate the optical signals of extremely wide and adaptable airborne lenses.
translated by 谷歌翻译
Many problems involve the use of models which learn probability distributions or incorporate randomness in some way. In such problems, because computing the true expected gradient may be intractable, a gradient estimator is used to update the model parameters. When the model parameters directly affect a probability distribution, the gradient estimator will involve score function terms. This paper studies baselines, a variance reduction technique for score functions. Motivated primarily by reinforcement learning, we derive for the first time an expression for the optimal state-dependent baseline, the baseline which results in a gradient estimator with minimum variance. Although we show that there exist examples where the optimal baseline may be arbitrarily better than a value function baseline, we find that the value function baseline usually performs similarly to an optimal baseline in terms of variance reduction. Moreover, the value function can also be used for bootstrapping estimators of the return, leading to additional variance reduction. Our results give new insight and justification for why value function baselines and the generalized advantage estimator (GAE) work well in practice.
translated by 谷歌翻译
Automatic segmentation is essential for the brain tumor diagnosis, disease prognosis, and follow-up therapy of patients with gliomas. Still, accurate detection of gliomas and their sub-regions in multimodal MRI is very challenging due to the variety of scanners and imaging protocols. Over the last years, the BraTS Challenge has provided a large number of multi-institutional MRI scans as a benchmark for glioma segmentation algorithms. This paper describes our contribution to the BraTS 2022 Continuous Evaluation challenge. We propose a new ensemble of multiple deep learning frameworks namely, DeepSeg, nnU-Net, and DeepSCAN for automatic glioma boundaries detection in pre-operative MRI. It is worth noting that our ensemble models took first place in the final evaluation on the BraTS testing dataset with Dice scores of 0.9294, 0.8788, and 0.8803, and Hausdorf distance of 5.23, 13.54, and 12.05, for the whole tumor, tumor core, and enhancing tumor, respectively. Furthermore, the proposed ensemble method ranked first in the final ranking on another unseen test dataset, namely Sub-Saharan Africa dataset, achieving mean Dice scores of 0.9737, 0.9593, and 0.9022, and HD95 of 2.66, 1.72, 3.32 for the whole tumor, tumor core, and enhancing tumor, respectively. The docker image for the winning submission is publicly available at (https://hub.docker.com/r/razeineldin/camed22).
translated by 谷歌翻译
As language models (LMs) scale, they develop many novel behaviors, good and bad, exacerbating the need to evaluate how they behave. Prior work creates evaluations with crowdwork (which is time-consuming and expensive) or existing data sources (which are not always available). Here, we automatically generate evaluations with LMs. We explore approaches with varying amounts of human effort, from instructing LMs to write yes/no questions to making complex Winogender schemas with multiple stages of LM-based generation and filtering. Crowdworkers rate the examples as highly relevant and agree with 90-100% of labels, sometimes more so than corresponding human-written datasets. We generate 154 datasets and discover new cases of inverse scaling where LMs get worse with size. Larger LMs repeat back a dialog user's preferred answer ("sycophancy") and express greater desire to pursue concerning goals like resource acquisition and goal preservation. We also find some of the first examples of inverse scaling in RL from Human Feedback (RLHF), where more RLHF makes LMs worse. For example, RLHF makes LMs express stronger political views (on gun rights and immigration) and a greater desire to avoid shut down. Overall, LM-written evaluations are high-quality and let us quickly discover many novel LM behaviors.
translated by 谷歌翻译